Large-scale Reverse Engineering by the Lasso 
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We perform a reverse engineering from the "extended Spellman data", consisting of 6178 mRNA 
levels measured by microarrays at 73 instances in four time series during the cell cycle of the 
yeast Saccharomyces cerevisae. By assuming a linear model of the genetic regulatory network, and 
imposing an extra constraint (the Lasso), we obtain a unique inference of coupling parameters. These 
parameters are transfered into an adjacent matrix, which is analyzed with respect to topological 
properties and biological relevance. We find a very broad distribution of outdegrees in the network, 
compatible with earlier findings for biological systems and totally incompatible with a random graph, 
and also indications of modules in the network. Finally, we show there is an excess of genes coding 
for transcription factors among the genes of highest outdegrees, a fact which indicates that our 
approach has biological relevance. 
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Advances in microarray technologies make it today 
possible to measure mRNA-levels for thousands of genes 
simultaneously. Also, large-scale measurements of pro- 
tein levels are gradually becoming feasible, as well as 
results on two-hybrid measurements on protein-protein 
interactions and genome- wide data for DNA binding pro- 
teins. These processes have emphasized the need for com- 
putational biology in order to get as much information 
out of such measurements as possible. 

One way to handle these data is to infer, or reverse en- 
gineer, genetic regulatory networks from temporal data. 
Although still somewhat speculative, researchers are ex- 
ploring the boundaries for what kind of inference that is 
possible. There are many approaches for network forma- 
tion, ranging from Boolean circuits to very complicated 
non-linear spatial models, see and references therein. 
Most models use only transcript data, whereas some in- 
corporate other chemical constituents as well. A model 
based on mRNA-data only is nothing but an effective 
network of gene-to-gene interactions. This might look 
too simplistic in view of the complete network, which in- 
cludes metabolites, proteins, etc., but it can be thought 
of as a projection onto the space of genes only. By fo- 
cusing only on transcript data, the networks obtained are 
not biochemical regulatory networks, but phenomenolog- 
ical networks where many physical connections between 
macromolecules might be hidden by short-cuts, i.e., many 
intermediate units in regulatory cascades might be hid- 
den 0. 

A special class, which has gained some popularity, is 
the linear, continuous model. Of course, no one claims 
there is a linear relationship between the units in a real 
regulatory network. Instead, the working hypothesis is 
that linear equations can at least capture the main fea- 
tures of the network. The main argument is that many 
functions can be quite accurately approximated around 
a specific working point with their linearization. Thus, 
it can provide a good starting point for further consider- 



ations. 

A key problem for all models are, however, shortage of 
data. The number of genes is in general much larger than 
the number of measurements, and different authors have 
taken somewhat different avenues to remedy this obsta- 
cle. For the linear continuous model based on transcript 
data, the first study we are aware of was by D'haeseler 
et al. jsj and focused on a subset of less than hundred 
genes that were believed to be interrelated. Their prob- 
lem was still underdetermined, and they interpolated the 
data in order to achieve more, simulated, measurements. 
However, more measurements in the same time-series is 
an ineffective way of increasing the information content 
in the data 0]. Another early study was by vanSomeren 
et al. [^, who clustered genes into the same number 
of groups as they had measurements, and thus obtained 
a mathematically well-posed problem. Still another ap- 
proach was explored by Holter et al. 0, who formed 
networks among the principal components of the data. 
A more biologically motivated study was performed by 
Yeung et al. [7| , who assumed that the resulting network 
should be sparse and that way got a unique solution. 
Somewhat in the same spirit, vanSomeren et al. con- 
ducted a systematic study on how to incorporate prior 
knowledge into the inference procedure. Finally we men- 
tion the only large-scale inference we are aware of. It was 
conducted by Dewey and Galas 0, who considered the 
whole genome of yeast with more than 6000 genes, and 
formed the network by taking the solution which min- 
imized the L^-norm of the coefficients and set to zero 
all matrix elements below a certain threshold. However, 
the resulting network had connections only for 143 genes, 
and although they justify that their result makes biologi- 
cal sense on a small scale, they lack a large-scale analysis. 

In the present letter, we utilize one statistical method, 
the Lasso ^3 , to reverse engineer a network among ORFs 
("Open Reading Frames, hereafter referred to as "genes") 
in the so-called extended Spellman dataset. This dataset 
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is one of the most referenced sources of microarray data 
and contains measured mRNA levels of 6178 genes for the 
yeast S. cerevisae, presented as logarithms of the frac- 
tion between the measured level and a reference level. 
The measurements of interest for us are carried out 
through one or more periods of the cell cycle in four time 
series — Alpha . C DC 15. and Elutrition from j^, and 
CDC28 from [l3| — with different synchronization proce- 
dures. The total number of experiments in all series is 
73, divided as 18, 24, 14 and 17 microarray experiments 
for each series. 

The missing data in this set are estimated by the proce- 
dure proposed in ■ Essentially, it consists of selecting 
genes with expression profiles similar — in the Euclidean 
distance — to the gene of interest to impute missing val- 
ues. The number of neighboring genes used to estimate 
the missing values is here 15. Finally, we center and nor- 
malize the expression data to have zero mean and unit 
variance. 

We consider a linear, continuous model of the form 



N 



hjXj{t) + fj. 



(1) 



Here Xi{t) denotes the logarithm of the ratio values of 
mRNA of gene i at time t, and N = 6178 denotes the 
number of genes. The coefficient Wij is the effect of gene 
j on gene i and does not depend on time. 

The network is inferred by minimizing the residual sum 
of squares, with an extra constraint on the L^-norm of 
the coefficients (the Lasso). The hyperoctahedronal form 
of the constraint makes it more likely that coefficients 
should become identical zero. Explicitly, it takes the form 
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(2) 



,N. (3) 



Each microarray measurement is here supposed to have 
been performed at time tk and there are K experiments. 
The time derivatives are obtained by spline interpolation 
of the o rigi nal data, where we make use of so-called taut 
splines [14| in order to achieve curves that are faithful to 
the measured data but still do not oscillate too wildly. By 
this procedure, the problem factorize and we can perform 
the minimization for each gene i separately. 

For small enough constraint parameters fii, the solu- 
tion is unique. To choose these parameters, we first solve 
the combined minimization problem of the residual sum 
of squares in ^ together with the minimization of the 
L^-norm of the coefficients. These values, 
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FIG. 1: Distribution of outdegrees for the inferred network. 
This distribution is incompatible with one from a random 
graph. Further, there are 3331 nodes with outdegree zero. 



are used as base-lines against which we measure the size 
of the L^-constraints. Here we utilize the values fii — 

(2) 

0.1/ij . However, varying the coefficient from the value 
0.1 does not result in large changes in the chosen subsets. 
With this choice, the presented solution in this letter is 
unique flst. 

To analyze the topological properties of the network, 
we focus on the adjacent matrix A, obtained from the 
coupling matrix w as 



if Wj, = 

1 otherwise. 



(5) 



Hence we obtain an unweighted digraph. 

For this digraph, we obtain a distribution of indegrees 
varying between unity and eight. This we attribute as an 
artifact of the Lasso procedure, because the sum of the 
modulus of the coefficients is forced not to exceed a spe- 
cific value, and hence it is natural that there is no large 
spread in their number of non-zero values. More inter- 
esting is the distribution of outdegrees, which is depicted 
in Fig. n The distribution is very skew, totally incom- 

P' ible with a Poisson distribution of a random graph 
. For the present distribution of indegrees, the prob- 
ability that any gene gets an outdegree that exceeds 50 
by accident is less than lO^'*" given that the edges are 
independently uniformly distributed. 

Our obtained distribution of outdegrees does not fol- 
low a power-law for more than one decade, i.e., it is not 
scale- free; a property that many other biological networks 
seem to have [mj. Still, the distribution is broad, and we 
expect it to be robust as scale-free networks have been 
proven to be — both topologically and dynamically 
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FIG. 2: Participation ratios for the normalized eigenvectors 
(upper curves) and density of eigenvalues (lower curves) ver- 
sus eigenvalues of the transpose of the matrix T from JSJ. 
The results of the rewired networks, dashed curves, are the 
variation of one standard deviation around the mean of 20 
randomized networks where the degree distribution has been 
kept constant and thus comprise a null hypothesis. The eigen- 
values equal to unity, as well as the zero eigenvalues, have 
been discarded here. The results are binned for clarity into 
bins of size 0.05. There is a more modular structure in the 
yeast network than could be expected from a random system. 



We also search for possible modules in the obtained 
network. To perform this search, we employ the for- 
malism in |2o| . where the authors unraveled modules in 
the internet. Here we explore the network in undirected 
form, i.e., we study the corresponding undirected graph 
obtained by making the adjacent matrix ij^l symmetric. 
The participation ratio of each (normalized) eigenvector 
to the transpose of the matrix 



T ■ — 



if nodes i and j not are adjacent 







1 / kj otherwise. 



(6) 



where kj is the degree of node j, gives an estimate of 
the size of the corresponding module. The eigenvalue 
itself corresponds to how tightly connected the module 
is. In Fig. 121 we see how the participation ratios vary 
with the eigenvalues and also the density of eigenvalues. 
As a null hypothesis, we depict the variation within one 
standard deviation for values obtained from randomized 
networks with the same degree distribution as the actual 
yeast network, as described in [ij- We see in the figure 
how there are some modules in the yeast network, by 
the fact that there are quite high participation numbers, 
compared with the randomized networks, for eigenvalues 
around 0.8. A closer analysis of these modules will be 
published elsewhere. 

To study the biological relevance of the inferred net- 
work, we return to the distribution of outdegrees in 



Fig. n The gene RRN5 has the highest outdegree, 149. 
According to the Saccharomyces Genome Database, SGD 
[2^ , it is involved in transcription of rDNA by RN A poly- 
merase I. A systematic deletion gives an inviable organ- 
ism. The genes YHL026C and YJR079W have the sec- 
ond and third highest outdegrees, 53 and 51, respectively. 
According to SGD, the organism is still viable after a 
systematic deletion of each. The molecular functions of 
the genes are unknown, as are the biological processes in 
which they are involved. 

It does not seem too unrealistic to associate the nodes 
with high outdegrees with transcription factors, although 
the edges in the obtained network must not be inter- 
preted as biochemical interactions only. However, a pre- 
vious study based on 273 single gene-deletion experi- 
ments for the same kind of y east as we have did not 
show any such correlation [23. Still, though, the con- 
jecture makes sense, and we investigate if the (known) 
transcription factors of yeast are overrepresented among 
the genes with highest outdegrees. In order to do so, we 
exploit the procedure proposed in |23 |. 

We rank the genes according to their outdegree, giving 
the highest rank (i.e, rank number one) to the gene with 
the highest outdegree. From the GO-database [l^ we 
obtain a classification of each gene whether it codes for 
a transcription factor or not (or if it is unknown). From 
these data, we form the cumulative excess of genes which 
code for transcription factors. 



A. = #{TF < r} 



#{TF} 

M ' 



(7) 



as a function of rank r. All genes are ranked, so r = 
1, ■ • ■ ,N. Here #{TF < r} is the number of genes known 
to code for transcription factors and rir is the number of 
classified genes, both in the set of genes with rank < r, 
#{TF} — 308 is the total number of genes known to 
code for transcription factors, and finally M = 3294 is 
the total number of classified genes. 

The number we subtract is the expected number of 
genes coding for transcription factors under the null hy- 
pothesis that they are uniformly distributed in outdegree 
ranks. In Fig. Owe show A^, the cumulative excess, as 
function of rank. The slope of the curve corresponds to 
the excess of transcri ptio n factors, i.e., the deviation from 
the null hypothesis. I2g| 

The curve in Fig.^shows a clear excess of genes coding 
for transcription factors among the nodes with high out- 
degrees, between 400 and 2000, approximately. To see 
this, we depict in the figure also the curves correspond- 
ing to plus and minus one standard deviation under the 
null hypothesis (dashed curves). The ratio between the 
observed deviation and the standard deviation translate 
into standard Z-scores. We have a signal of 4.8 standard 
deviations for the first 737 genes, 3.2 standard deviations 
for the first 1000 genes and 4.6 standard deviations for 
the first 2000 genes. Hence, we claim that the obtained 
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FIG. 3; Cumulated excess of genes coding for transcription 
factors, ranked according to their outdegrees. The dashed 
curves correspond to one standard deviation under the nuU 
hypothesis that the transcription factors are uniformly dis- 
tributed among the classified genes. The straight lines corre- 
spond to sets of genes with the same outdegree, and whose 
order within the set thereby is arbitrary. 



distribution of genes is very far from accidental. 

In summa, we have presented the application of a spe- 
cific inference procedure to the reverse engineering prob- 
lem of an effective regulatory network from temporal 
data. By studying the simplified network where we have 
discarded the weights of the links, we find a distribution 
of outdegrees which is broad, as well as modules in the 
network. The existence of nodes with high outdegrees by 
chance is improbable. A closer look shows that the most 
connected node in the network represents a gene which is 
involved in transcription. Especially, we also find a clear 
excess of genes coding for transcription factors among 
the genes with high outdegrees. Given the simplicity of 
our approach, utilizing only linear models and transcript 
data, the method works reasonably well. 
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